The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/lib/python3.9/site-packages (1.5.12) Requirement already satisfied: tqdm in /usr/local/lib/python3.9/site-packages (from kaggle) (4.62.3) Requirement already satisfied: python-slugify in /usr/local/lib/python3.9/site-packages (from kaggle) (5.0.2) Requirement already satisfied: urllib3 in /usr/local/lib/python3.9/site-packages (from kaggle) (1.26.7) Requirement already satisfied: python-dateutil in /usr/local/lib/python3.9/site-packages (from kaggle) (2.8.2) Requirement already satisfied: certifi in /usr/local/lib/python3.9/site-packages (from kaggle) (2021.10.8) Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.9/site-packages (from kaggle) (1.15.0) Requirement already satisfied: requests in /usr/local/lib/python3.9/site-packages (from kaggle) (2.26.0) Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.9/site-packages (from python-slugify->kaggle) (1.3) Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/site-packages (from requests->kaggle) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/site-packages (from requests->kaggle) (3.3) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv WARNING: You are using pip version 21.3.1; however, version 23.0.1 is available. You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
!pwd
/root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2
!pwd
/root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2
!ls -l /root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/kaggle.json
-rw-r--r-- 1 root root 63 Apr 10 22:36 /root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/kaggle.json
# !mkdir /.kaggle
!cp kaggle.json ~/.kaggle
# # !chmod 600 ~/.kaggle/kaggle.json
! kaggle competitions files home-credit-default-risk
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /root/.kaggle/kaggle.json' Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.13 / client 1.5.12) name size creationDate ---------------------------------- ----- ------------------- previous_application.csv 386MB 2019-12-11 02:55:35 credit_card_balance.csv 405MB 2019-12-11 02:55:35 installments_payments.csv 690MB 2019-12-11 02:55:35 HomeCredit_columns_description.csv 37KB 2019-12-11 02:55:35 application_train.csv 158MB 2019-12-11 02:55:35 bureau_balance.csv 358MB 2019-12-11 02:55:35 application_test.csv 25MB 2019-12-11 02:55:35 bureau.csv 162MB 2019-12-11 02:55:35 POS_CASH_balance.csv 375MB 2019-12-11 02:55:35 sample_submission.csv 524KB 2019-12-11 02:55:35
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
The HomeCredit_columns_description.csv acts as a data dictioanry.
There are 7 different sources of data:
name [ rows cols] MegaBytes
----------------------- ------------------ -------
application_train : [ 307,511, 122]: 158MB
application_test : [ 48,744, 121]: 25MB
bureau : [ 1,716,428, 17] 162MB
bureau_balance : [ 27,299,925, 3]: 358MB
credit_card_balance : [ 3,840,312, 23] 405MB
installments_payments : [ 13,605,401, 8] 690MB
previous_application : [ 1,670,214, 37] 386MB
POS_CASH_balance : [ 10,001,358, 8] 375MB
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "/root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir DATA_DIR
mkdir: cannot create directory ‘DATA_DIR’: File exists
!ls -l DATA_DIR
total 2621364 -rw-r--r-- 1 root root 37383 Apr 10 23:05 HomeCredit_columns_description.csv -rw-r--r-- 1 root root 392703158 Apr 10 23:05 POS_CASH_balance.csv -rw-r--r-- 1 root root 26567651 Apr 10 23:05 application_test.csv -rw-r--r-- 1 root root 166133370 Apr 10 23:05 application_train.csv -rw-r--r-- 1 root root 170016717 Apr 10 23:05 bureau.csv -rw-r--r-- 1 root root 375592889 Apr 10 23:05 bureau_balance.csv -rw-r--r-- 1 root root 424582605 Apr 10 23:05 credit_card_balance.csv -rw-r--r-- 1 root root 723118349 Apr 10 23:06 installments_payments.csv -rw-r--r-- 1 root root 404973293 Apr 10 23:06 previous_application.csv -rw-r--r-- 1 root root 536202 Apr 10 23:06 sample_submission.csv
! kaggle competitions download home-credit-default-risk -p $DATA_DIR --force
Downloading home-credit-default-risk.zip to /root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2 100%|███████████████████████████████████████▉| 688M/688M [01:26<00:00, 7.51MB/s] 100%|████████████████████████████████████████| 688M/688M [01:26<00:00, 8.36MB/s]
# !chmod 600 /root/.kaggle/kaggle.json
!pwd
/root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2
!ls -l $DATA_DIR
total 714032 drwxr-xr-x 1 root root 512 Apr 10 23:01 Coursesv2 drwxr-xr-x 1 root root 512 Apr 10 23:08 DATA_DIR -rwxrwxrwx 1 root root 7014975 Apr 11 21:56 HCDR_baseLine_submission_with_numerical_and_cat_features_to_kaggle.ipynb -rwxrwxrwx 1 root root 11 Feb 21 05:06 Phase2.md -rw-r--r-- 1 root root 721616255 Apr 11 21:57 home-credit-default-risk.zip -rwxrwxrwx 1 root root 66899 Feb 21 05:06 home_credit.png -rw-r--r-- 1 root root 63 Apr 10 22:36 kaggle.json -rwxrwxrwx 1 root root 1368981 Feb 21 05:06 submission.csv -rwxrwxrwx 1 root root 1091396 Feb 21 05:06 submission.png
#!rm -r DATA_DIR
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
unzippingReq = True #True
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile(f'{DATA_DIR}/home-credit-default-risk.zip', 'r')
# extractall(): Extract all members from the archive to the current working directory. path specifies a different directory to extract to
zip_ref.extractall('DATA_DIR')
zip_ref.close()
ls -l ../../../Data/home-credit-default-risk/application_train.csv
ls: cannot access '../../../Data/home-credit-default-risk/application_train.csv': No such file or directory
!ls -l Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/DATA_DIR/application_train.csv
ls: cannot access 'Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/DATA_DIR/application_train.csv': No such file or directory
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
DATA_DIR = "/root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/DATA_DIR"
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
(307511, 122)
DATA_DIR
'/root/shared/Coursesv2/AML526/I526_AML_Student/Assignments/Unit-Project-Home-Credit-Default-Risk/Phase2/DATA_DIR'
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
CPU times: user 30.7 s, sys: 11.2 s, total: 42 s Wall time: 52.3 s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 27,299,925, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
datasets["application_train"].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
datasets["application_train"].describe() #numerical only features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
datasets["application_test"].describe() #numerical only features
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
datasets["application_train"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278180.518577 | 0.080729 | NaN | NaN | NaN | NaN | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | NaN | NaN | NaN | NaN | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | NaN | NaN | NaN | NaN | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
11 rows × 122 columns
datasets["application_test"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744 | 48744 | 48744 | 48744 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| unique | NaN | 2 | 2 | 2 | 2 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | 48305 | 32678 | 32311 | 33658 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 277796.676350 | NaN | NaN | NaN | NaN | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | NaN | NaN | NaN | NaN | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | NaN | NaN | NaN | NaN | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | NaN | NaN | NaN | NaN | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
11 rows × 121 columns
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
Visualizing the missing data for the application train data based on the above analysis.We are creating a horizontal bar plot to visualize the top 60 variables with the highest percentage of missing values in the missing_application_train_data dataset.
pip install missingno
Requirement already satisfied: missingno in /usr/local/lib/python3.9/site-packages (0.5.2) Requirement already satisfied: scipy in /usr/local/lib/python3.9/site-packages (from missingno) (1.7.3) Requirement already satisfied: seaborn in /usr/local/lib/python3.9/site-packages (from missingno) (0.11.2) Requirement already satisfied: matplotlib in /usr/local/lib/python3.9/site-packages (from missingno) (3.4.3) Requirement already satisfied: numpy in /usr/local/lib/python3.9/site-packages (from missingno) (1.22.0) Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.9/site-packages (from matplotlib->missingno) (3.0.6) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.9/site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.9/site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.9/site-packages (from matplotlib->missingno) (9.0.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.9/site-packages (from matplotlib->missingno) (1.3.2) Requirement already satisfied: pandas>=0.23 in /usr/local/lib/python3.9/site-packages (from seaborn->missingno) (1.3.5) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.9/site-packages (from pandas>=0.23->seaborn->missingno) (2021.3) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.15.0) WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv WARNING: You are using pip version 21.3.1; however, version 23.0.1 is available. You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command. Note: you may need to restart the kernel to use updated packages.
import missingno as missingnum
# missingnum.matrix(datasets["application_train"], color=(0.2980392156862745, 0.4470588235294118, 0.6901960784313725))
<AxesSubplot:>
#providing figure size details
plt.figure(figsize=(17, 8))
#sorting values to be plotted
missing_application_train_data['Percent'].sort_values().tail(60).plot.barh(figsize=(15, 30), color='pink')
plt.title("Top 60 variables with the highest percentage of missing values in the application train data")
plt.grid(True)
plt.show();
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_test_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_test_data.head(20)
| Percent | Test Missing Count | |
|---|---|---|
| COMMONAREA_AVG | 68.72 | 33495 |
| COMMONAREA_MODE | 68.72 | 33495 |
| COMMONAREA_MEDI | 68.72 | 33495 |
| NONLIVINGAPARTMENTS_AVG | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MODE | 68.41 | 33347 |
| NONLIVINGAPARTMENTS_MEDI | 68.41 | 33347 |
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| LIVINGAPARTMENTS_AVG | 67.25 | 32780 |
| LIVINGAPARTMENTS_MODE | 67.25 | 32780 |
| LIVINGAPARTMENTS_MEDI | 67.25 | 32780 |
| FLOORSMIN_MEDI | 66.61 | 32466 |
| FLOORSMIN_AVG | 66.61 | 32466 |
| FLOORSMIN_MODE | 66.61 | 32466 |
| OWN_CAR_AGE | 66.29 | 32312 |
| YEARS_BUILD_AVG | 65.28 | 31818 |
| YEARS_BUILD_MEDI | 65.28 | 31818 |
| YEARS_BUILD_MODE | 65.28 | 31818 |
| LANDAREA_MEDI | 57.96 | 28254 |
| LANDAREA_AVG | 57.96 | 28254 |
| LANDAREA_MODE | 57.96 | 28254 |
We are creating a horizontal bar plot to visualize the top 60 variables with the highest percentage of missing values in the missing_application_test_data dataset.
plt.figure(figsize=(17, 7))
missing_application_test_data['Percent'].sort_values().tail(60).plot.barh(figsize=(17, 35), color='purple')
plt.title("Top 60 variables with the highest percentage of missing values in the application test data")
plt.grid(True)
plt.show();
Observations based on above plots: -Large portion of the data is missing , as we can see from the above plots.
# Saving the training and test data sets
ap_train_data= datasets["application_train"]
ap_test_data = datasets["application_test"]
# Determining the categorical and numerical features
numerical_features = ap_train_data.select_dtypes(include = ['int64', 'float64']).columns
categorical_features = ap_train_data.select_dtypes(include = ['object', 'bool']).columns
print(f"\nNumerical features : {list(numerical_features)}")
print(f"\nCategorical features : {list(categorical_features)}")
Numerical features : ['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'TOTALAREA_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR'] Categorical features : ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
len(list(numerical_features))
106
len(list(categorical_features))
16
total_ip_features = len(list(numerical_features)) + len(list(categorical_features))
total_ip_features
122
#Distributin of target
import matplotlib.pyplot as plt
%matplotlib inline
datasets["application_train"]['TARGET'].astype(int).plot.hist();
Observation from the above plot- 1.It is evident from the above plot,we can infer that greater than 250000 people have paid back the loan on time.
To comprehend the distribution of the binary variable TARGET, that indicates whether or not a loan application failed on their loan. We can determine the class balance or imbalance in the dataset by counting the number of instances in each class.
ap_train_data['TARGET'].value_counts()
0 282686 1 24825 Name: TARGET, dtype: int64
Observation from the above plot 2. The above count confirms our inference of people paying greater than 250000 , a total of 282686 people paid there loans.
plt.figure(figsize=(7, 7))
plt.pie(x=ap_train_data['TARGET'].value_counts(),
radius=1.5-0.5,
labels=ap_train_data['TARGET'].value_counts().index,
autopct='%1.1f%%',
colors=['fuchsia', 'yellow'],
explode=[0,0.5],
wedgeprops={"edgecolor":"0", "width":0.4},
startangle=160,
shadow=True,
textprops={'fontsize': 14})
plt.ylabel('TARGET', fontsize=15)
plt.title('Plot for distribution of TARGET feature', fontsize=17)
plt.show()
Observation from the above plota.The TARGET feature exhibits a significant amount of unbalance, as can be observed,because of this, measuring the accuracy performance metric will be difficult.
correlations = datasets["application_train"].corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(10))
print('\nMost Negative Correlations:\n', correlations.head(10))
Most Positive Correlations: FLAG_DOCUMENT_3 0.044346 REG_CITY_NOT_LIVE_CITY 0.044395 FLAG_EMP_PHONE 0.045982 REG_CITY_NOT_WORK_CITY 0.050994 DAYS_ID_PUBLISH 0.051457 DAYS_LAST_PHONE_CHANGE 0.055218 REGION_RATING_CLIENT 0.058899 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_BIRTH 0.078239 TARGET 1.000000 Name: TARGET, dtype: float64 Most Negative Correlations: EXT_SOURCE_3 -0.178919 EXT_SOURCE_2 -0.160472 EXT_SOURCE_1 -0.155317 DAYS_EMPLOYED -0.044932 FLOORSMAX_AVG -0.044003 FLOORSMAX_MEDI -0.043768 FLOORSMAX_MODE -0.043226 AMT_GOODS_PRICE -0.039645 REGION_POPULATION_RELATIVE -0.037227 ELEVATORS_AVG -0.034199 Name: TARGET, dtype: float64
Observations from the above data:
#creating the correlation for the training data
tr_data_corr=ap_train_data.corr()
plt.figure(figsize=(27, 17))
sns.heatmap(tr_data_corr, cmap='viridis')
plt.title("Correlation map for application traininig data")
plt.plot();
Observations from the above data: -The heatmap is a bit ambigous to read as we have 122 columns to compare from.
# Extracting the AMOUNT variables
#Representing the correlations
amt_data = tr_data_corr[['TARGET', 'AMT_CREDIT', 'AMT_INCOME_TOTAL', 'AMT_GOODS_PRICE']]
amt_corr = amt_data.corr()
print(amt_corr)
TARGET AMT_CREDIT AMT_INCOME_TOTAL AMT_GOODS_PRICE TARGET 1.000000 -0.143445 -0.079451 -0.153852 AMT_CREDIT -0.143445 1.000000 0.355705 0.999502 AMT_INCOME_TOTAL -0.079451 0.355705 1.000000 0.358367 AMT_GOODS_PRICE -0.153852 0.999502 0.358367 1.000000
amt_corr
| TARGET | AMT_CREDIT | AMT_INCOME_TOTAL | AMT_GOODS_PRICE | |
|---|---|---|---|---|
| TARGET | 1.000000 | -0.143445 | -0.079451 | -0.153852 |
| AMT_CREDIT | -0.143445 | 1.000000 | 0.355705 | 0.999502 |
| AMT_INCOME_TOTAL | -0.079451 | 0.355705 | 1.000000 | 0.358367 |
| AMT_GOODS_PRICE | -0.153852 | 0.999502 | 0.358367 | 1.000000 |
plt.figure(figsize=(14, 8))
sns.heatmap(amt_corr, annot=True, cmap='PuRd')
plt.title("Heatmap for representing the AMOUNT correlations")
plt.plot();
Extract the EXT_SOURCE variables and Pair plot of the top 4 correlated features
ext_source_vars = ["EXT_SOURCE_3", "EXT_SOURCE_2", "EXT_SOURCE_1","DAYS_BIRTH"]
sns.pairplot(data = datasets["application_train"], hue="TARGET", vars = ext_source_vars, height=4, diag_kind="hist")
plt.title("Pair plots of the top 4 correlated features")
plt.show()
ext_source_vars_corr = tr_data_corr[ext_source_vars].corr()
ext_source_vars_corr
| EXT_SOURCE_3 | EXT_SOURCE_2 | EXT_SOURCE_1 | DAYS_BIRTH | |
|---|---|---|---|---|
| EXT_SOURCE_3 | 1.000000 | 0.261958 | 0.473936 | -0.474564 |
| EXT_SOURCE_2 | 0.261958 | 1.000000 | 0.476205 | -0.229976 |
| EXT_SOURCE_1 | 0.473936 | 0.476205 | 1.000000 | -0.847116 |
| DAYS_BIRTH | -0.474564 | -0.229976 | -0.847116 | 1.000000 |
plt.figure(figsize=(12, 9))
sns.heatmap(ext_source_vars_corr, annot=True, cmap='PuRd')
plt.title("Correlation map of top 4 features")
plt.plot();
Observation based on above
ext_source_vars_corr.isnull().count()
plt.hist(datasets["application_train"]['DAYS_BIRTH'] / -365, edgecolor = 'k', bins = 25)
plt.title('Age of Client'); plt.xlabel('Age (years)'); plt.ylabel('Count');
Observations from the above plot
The DAYS_BIRTH feature, which contains negative values, is used to calculate age. This is inconsistent and needs to be fixed.
When we plot age as a function of years, we observe a pretty normal distribution, which is encouraging in a challenging dataset where the DAYS BIRTH feature is substantially linked with the TARGET feature.
#Distribution based on gender,Applicants gender
gend = datasets["application_train"]['CODE_GENDER']
plt.figure(figsize=(10,7))
sns.countplot(data =datasets["application_train"], x= datasets["application_train"]['TARGET'], hue=gend, palette='rainbow')
plt.title("Distribution of target column based on Gender")
plt.show()
Observation from the above plot -Therefore it is evident from the count plot that women are obtaining more credit than men.
The purpose of this code is to group the 'YEARS_BIRTH' column into different age groups and analyze the distribution of 'TARGET' variable in each age group.
# creating different data frames for putting in age information
age_df = datasets["application_train"][['TARGET', 'DAYS_BIRTH']]
age_df['YEARS_BIRTH'] = age_df['DAYS_BIRTH'] / -365
# Bin the age data
age_df['GROUPED_YEARS_BIRTH'] = pd.cut(age_df['YEARS_BIRTH'], bins = np.linspace(20, 70, num = 11))
age_df.head(20)
| TARGET | DAYS_BIRTH | YEARS_BIRTH | GROUPED_YEARS_BIRTH | |
|---|---|---|---|---|
| 0 | 1 | -9461 | 25.920548 | (25.0, 30.0] |
| 1 | 0 | -16765 | 45.931507 | (45.0, 50.0] |
| 2 | 0 | -19046 | 52.180822 | (50.0, 55.0] |
| 3 | 0 | -19005 | 52.068493 | (50.0, 55.0] |
| 4 | 0 | -19932 | 54.608219 | (50.0, 55.0] |
| 5 | 0 | -16941 | 46.413699 | (45.0, 50.0] |
| 6 | 0 | -13778 | 37.747945 | (35.0, 40.0] |
| 7 | 0 | -18850 | 51.643836 | (50.0, 55.0] |
| 8 | 0 | -20099 | 55.065753 | (55.0, 60.0] |
| 9 | 0 | -14469 | 39.641096 | (35.0, 40.0] |
| 10 | 0 | -10197 | 27.936986 | (25.0, 30.0] |
| 11 | 0 | -20417 | 55.936986 | (55.0, 60.0] |
| 12 | 0 | -13439 | 36.819178 | (35.0, 40.0] |
| 13 | 0 | -14086 | 38.591781 | (35.0, 40.0] |
| 14 | 0 | -14583 | 39.953425 | (35.0, 40.0] |
| 15 | 0 | -8728 | 23.912329 | (20.0, 25.0] |
| 16 | 0 | -12931 | 35.427397 | (35.0, 40.0] |
| 17 | 0 | -9776 | 26.783562 | (25.0, 30.0] |
| 18 | 0 | -17718 | 48.542466 | (45.0, 50.0] |
| 19 | 0 | -11348 | 31.090411 | (30.0, 35.0] |
age_group_years = age_df.groupby('GROUPED_YEARS_BIRTH').mean()
age_group_years
| TARGET | DAYS_BIRTH | YEARS_BIRTH | |
|---|---|---|---|
| GROUPED_YEARS_BIRTH | |||
| (20.0, 25.0] | 0.123036 | -8532.795625 | 23.377522 |
| (25.0, 30.0] | 0.111436 | -10155.219250 | 27.822518 |
| (30.0, 35.0] | 0.102814 | -11854.848377 | 32.479037 |
| (35.0, 40.0] | 0.089414 | -13707.908253 | 37.555913 |
| (40.0, 45.0] | 0.078491 | -15497.661233 | 42.459346 |
| (45.0, 50.0] | 0.074171 | -17323.900441 | 47.462741 |
| (50.0, 55.0] | 0.066968 | -19196.494791 | 52.593136 |
| (55.0, 60.0] | 0.055314 | -20984.262742 | 57.491131 |
| (60.0, 65.0] | 0.052737 | -22780.547460 | 62.412459 |
| (65.0, 70.0] | 0.037270 | -24292.614340 | 66.555108 |
age_group_years['YEARS_BIRTH'].plot.barh(figsize=(15, 7), color='crimson')
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
plt.xlabel('Frequency', fontsize=12)
plt.ylabel('Age Group', fontsize=12)
plt.grid(True)
plt.title('Age Group Distribution (Years)', fontsize=17)
plt.show()
Observation from the above plot- Following binning, we see that older persons have a propensity to take out more loans than younger people.
plt.figure(figsize = (9, 9))
# Graph the age bins and the average of the target as a bar plot
plt.bar(age_group_years.index.astype(str), 100 * age_group_years['TARGET'],color='green')
# Label
plt.xticks(rotation = 85); plt.xlabel('Age Group in Years'); plt.ylabel('Repayment Percentage Failure ')
plt.title('Repayment Failure based on Age Group');
Observations from the above plot The Binning method is utilized to provide precise information regarding age.
Now, it can generally be deduced that the age range of 20 to 25 years is where the most majority of people who are essentially unable to essentially repay the loan on time for the most part reside. Contrary to popular opinion, a decline in the number of failures is typically observed as age increases.Younger applicants are actually more likely to default on the loan, showing that it can now be generally deduced that the majority of borrowers who normally are unable to return the loan on time type of fall into the age range of 20–25 years, which is actually quite substantial.Younger applicants are generally more likely to default on a loan, so banks should generally be advised against doing so. This proves that, contrary to popular belief, the age group of 20 to 25 years is actually where the majority of borrowers fail to repay loans in a timely manner.
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"]);
plt.title('Applicants Occupation');
plt.grid(True)
plt.xticks(rotation=90);
Observation based on plot Contrary to popular opinion, the majority of the candidates' occupations are studied here. It is generally accepted that laborers kind of tend to kind of take the loan most fundamentally followed by sales staff, demonstrating how this is generally accepted that laborers kind of tend to kind of take the loan most fundamentally followed by sales staff in a subtle way.
Is creating a visual representation of the distribution of applicants' occupation types
plt.figure(figsize=(20, 7))
sns.countplot(x='OCCUPATION_TYPE', data=datasets["application_train"], hue='TARGET', palette='rainbow')
plt.xlabel('Type Of Occupation', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.xticks(rotation=80, fontsize=13)
plt.yticks(fontsize=11)
plt.title('Occupation of Applicant', fontsize=17)
plt.grid(True)
plt.plot();
Observation of the above plot
There are 18 distinct occupations represented among the borrowers, with laborers, salespeople, core employees, managers, and drivers at the top.
There is no clear pattern in the occupation classes whose borrowers effectively repay their loans.
Applicant Education Type v/s TARGET
ap_train_data['NAME_EDUCATION_TYPE'].value_counts()
Secondary / secondary special 218391 Higher education 74863 Incomplete higher 10277 Lower secondary 3816 Academic degree 164 Name: NAME_EDUCATION_TYPE, dtype: int64
By counting the number of occurrences of each unique value in this column, we can see how many applicants fall into each education category and get an idea of the overall educational background of the loan applicants in the dataset.
plt.figure(figsize=(15, 7))
sns.countplot(x='NAME_EDUCATION_TYPE', data=ap_train_data, palette='rainbow', hue='TARGET')
plt.xlabel('Type of Education', fontsize=15)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=50, fontsize=13)
plt.yticks(fontsize=11)
plt.title("Applicant's Education Type", fontsize=15)
plt.grid(True)
plt.plot();
Observation from the plot
Applicant Housing Type versus TARGET
ap_train_data['NAME_HOUSING_TYPE'].value_counts()
House / apartment 272868 With parents 14840 Municipal apartment 11183 Rented apartment 4881 Office apartment 2617 Co-op apartment 1122 Name: NAME_HOUSING_TYPE, dtype: int64
plt.figure(figsize=(17, 7))
sns.countplot(x='NAME_HOUSING_TYPE', data=ap_train_data, palette='rainbow', hue='TARGET')
plt.xlabel('Type of Housing', fontsize=15)
plt.ylabel('Count', fontsize=14)
plt.xticks(rotation=50, fontsize=13)
plt.yticks(fontsize=11)
plt.title("Applicant's Housing Type", fontsize=15)
plt.grid(True)
plt.plot();
Observation from the plot
Applicant Contract Type
ap_train_data['NAME_CONTRACT_TYPE'].value_counts()
Cash loans 278232 Revolving loans 29279 Name: NAME_CONTRACT_TYPE, dtype: int64
plt.figure(figsize=(8, 8))
plt.pie(x=ap_train_data['NAME_CONTRACT_TYPE'].value_counts(),
radius=1.5-0.5,
labels=ap_train_data['NAME_CONTRACT_TYPE'].value_counts().index,
autopct='%1.1f%%',
colors=['fuchsia', 'yellow'],
explode=[0,0.3],
wedgeprops={"edgecolor":"0", "width":0.5},
startangle=160,
shadow=True,
textprops={'fontsize': 15})
plt.ylabel('', fontsize=15)
plt.title("Applicant's Contract Type", fontsize=17)
plt.show()
Observation from the plot
Applicant Organization Type
plt.figure(figsize=(17, 30))
sns.countplot(y='ORGANIZATION_TYPE', data=ap_train_data,hue='TARGET',palette='rainbow')
plt.title("Type of Applicant's Organization", fontsize=15)
plt.xlabel('Organization Type', fontsize=15)
plt.ylabel('Count', fontsize=11)
plt.xticks(rotation=45, fontsize=12)
plt.yticks(fontsize=12)
plt.grid(True)
plt.plot();
Observation from the plot
Applicant's House Wall Material Type
ap_train_data['WALLSMATERIAL_MODE'].value_counts()
Panel 66040 Stone, brick 64815 Block 9253 Wooden 5362 Mixed 2296 Monolithic 1779 Others 1625 Name: WALLSMATERIAL_MODE, dtype: int64
plt.figure(figsize=(7, 7))
plt.pie(x=ap_train_data['WALLSMATERIAL_MODE'].value_counts(),
radius=1.5-0.5,
labels=ap_train_data['WALLSMATERIAL_MODE'].value_counts().index,
autopct='%1.1f%%',
colors=['fuchsia', 'yellow'],
wedgeprops={"edgecolor":"0", "width":0.5},
startangle=160,
shadow=True,
textprops={'fontsize': 15})
plt.ylabel('', fontsize=15)
plt.title("Applicant's House's Wall Material Type", fontsize=15)
plt.show()
Observation from the plot
Applicant's House Type Part 2
ap_train_data['HOUSETYPE_MODE'].value_counts()
block of flats 150503 specific housing 1499 terraced house 1212 Name: HOUSETYPE_MODE, dtype: int64
plt.figure(figsize=(9, 9))
plt.pie(x=ap_train_data['HOUSETYPE_MODE'].value_counts(),
radius=1.5-0.5,
labels=ap_train_data['HOUSETYPE_MODE'].value_counts().index,
autopct='%1.1f%%',
colors=['fuchsia', 'yellow', 'green'],
explode=[0,0.8,0.5],
wedgeprops={"edgecolor":"0", "width":0.3},
startangle=160,
shadow=True,
textprops={'fontsize': 14})
plt.ylabel('', fontsize=14)
plt.suptitle("Applicant's House Type", fontsize=16)
plt.show()
Observation from the plot
Applicant already own Realty?
ap_train_data['FLAG_OWN_REALTY'].value_counts()
Y 213312 N 94199 Name: FLAG_OWN_REALTY, dtype: int64
#plotting if teh applicant already own the realty
plt.figure(figsize=(7, 7))
plt.pie(x=ap_train_data['FLAG_OWN_REALTY'].value_counts(),
radius=1.5-0.5,
labels=ap_train_data['FLAG_OWN_REALTY'].value_counts().index,
autopct='%1.1f%%',
colors=['fuchsia', 'yellow'],
explode=[0,0.4],
startangle=55,
shadow=True,
textprops={'fontsize': 15})
plt.ylabel('', fontsize=15)
plt.title("Does the applicant already own a Realty?", fontsize=15)
plt.show()
Observation from the plot
plt.figure(figsize=(17, 8))
sns.countplot(x='FLAG_OWN_REALTY', data=ap_train_data, palette='rainbow', hue='TARGET')
plt.xlabel('Owns Realty? (No Default | Default)', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
plt.grid(True)
plt.title("If applicants owns realty versus Repayment", fontsize=15)
plt.plot();
Observation from the above plot:
-The majority of applicants in either class are not defaulters. Less than 25000 applicants have real estate and are behind on their payments.
Gender wise repayment analysis
plt.figure(figsize=(17, 7))
sns.countplot(x='CODE_GENDER', data=ap_train_data, palette='rainbow', hue='TARGET')
plt.xlabel('Gender (No Default | Default)', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.xticks(fontsize=11)
plt.yticks(fontsize=11)
plt.grid(True)
plt.title("Applicants' Gender and Repayment relation", fontsize=15)
plt.plot();
Observations
Female applicants make up the majority, and the majority of them have no history of defaults.
In the case of male candidates, it is clear that a disproportionately higher percentage of applicants default.
### **AMT_CREDIT feature distribution plotting**
plt.figure(figsize=(17, 7))
sns.violinplot(x=ap_train_data['AMT_CREDIT'], palette='rainbow')
plt.xticks(size=11)
plt.yticks(size=11)
plt.xlabel('AMT_CREDIT', size=15)
plt.title('Violinplot for AMT_CREDIT feature', size=15)
plt.grid(True)
Observations from the plot
## ** AMT_ANNUITY feature distribution plotting**
plt.figure(figsize=(17, 7))
sns.boxplot(x=ap_train_data['AMT_ANNUITY'], palette='rainbow')
plt.xticks(size=11)
plt.yticks(size=11)
plt.xlabel('AMT_ANNUITY', size=15)
plt.title('Boxplot for AMT_ANNUITY feature', size=15)
plt.grid(True)
Observations from the plot
A right-skewed feature with numerous outliers is once again visible.
We can't get rid of these outliers because doing so could mean losing crucial data.
# ** DAYS_EMPLOYED feature distribution plotting**
ap_train_data['DAYS_EMPLOYED'].describe()
count 307511.000000 mean 63815.045904 std 141275.766519 min -17912.000000 25% -2760.000000 50% -1213.000000 75% -289.000000 max 365243.000000 Name: DAYS_EMPLOYED, dtype: float64
plt.figure(figsize=(15, 7))
sns.distplot(x=ap_train_data['DAYS_EMPLOYED'], color='orange')
plt.xticks(size=12)
plt.yticks(size=12)
plt.xlabel('DAYS_EMPLOYED', size=14)
plt.title('KDE plot for DAYS_EMPLOYED feature', size=16)
plt.grid(True)
Observations from the plot
The negative days values for this feature are the same as those for DAYS_BIRTH.
But, we notice a strange anomaly in this situation: the maximum number of days that can be worked is 365243, or a thousand years.
We won't analyze teh anomaly we'll just ignore it and go ahead and examine the feature distribution once more.
# **Distribution of DAYS_EMPLOYED feature after removing the inconsistent value**
days_employed_fixed = ap_train_data['DAYS_EMPLOYED']
days_employed_fixed = days_employed_fixed[days_employed_fixed<365243]
plt.figure(figsize=(17, 7))
sns.distplot(x=days_employed_fixed, color='blue')
plt.xticks(size=11)
plt.yticks(size=11)
plt.xlabel('DAYS_EMPLOYED', size=15)
plt.title('KDE plot for DAYS_EMPLOYED feature after fixing', size=15)
plt.grid(True)
Observations from the plot
# **Distribution of AMT_GOODS_PRICE feature**
plt.figure(figsize=(17, 7))
sns.distplot(x=ap_train_data['AMT_GOODS_PRICE'], color='red')
plt.xticks(size=11)
plt.yticks(size=11)
plt.xlabel('AMT_GOODS_PRICE', size=15)
plt.title('KDE plot for AMT_GOODS_PRICE feature', size=15)
plt.grid(True)
Observations from the plot
We see yet another multi-modal skewed distribution.
Binning might make better use of this functionality.
# Fixing the issues with DAYS_EMPLOYES and DAYS_BIRTH features
ap_train_data['DAYS_BIRTH'] = ap_train_data['DAYS_BIRTH'] / -1
ap_test_data['DAYS_BIRTH'] = ap_test_data['DAYS_BIRTH'] / -1
ap_train_data['DAYS_EMPLOYED'] = ap_train_data['DAYS_EMPLOYED'][ap_train_data['DAYS_EMPLOYED']<365243]
ap_test_data['DAYS_EMPLOYED'] = ap_test_data['DAYS_EMPLOYED'][ap_test_data['DAYS_EMPLOYED']<365243]
ap_train_data['DAYS_EMPLOYED'] = ap_train_data['DAYS_EMPLOYED']/-1
ap_test_data['DAYS_EMPLOYED'] = ap_test_data['DAYS_EMPLOYED']/-1
ap_train_data['DAYS_EMPLOYED'].head()
0 637.0 1 1188.0 2 225.0 3 3039.0 4 3038.0 Name: DAYS_EMPLOYED, dtype: float64
ap_test_data['DAYS_EMPLOYED'].head()
0 2329.0 1 4469.0 2 4458.0 3 1866.0 4 2191.0 Name: DAYS_EMPLOYED, dtype: float64
ap_train_data['DAYS_BIRTH'].head()
0 9461.0 1 16765.0 2 19046.0 3 19005.0 4 19932.0 Name: DAYS_BIRTH, dtype: float64
ap_test_data['DAYS_BIRTH'].head()
0 19241.0 1 18064.0 2 20038.0 3 13976.0 4 13040.0 Name: DAYS_BIRTH, dtype: float64
list(datasets.keys())
['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance']
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
# is there an overlap between the test and train customers
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
#
datasets["application_test"].shape
(48744, 121)
datasets["application_train"].shape
(307511, 122)
The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.
appsDF = datasets["previous_application"]
display(appsDF.head())
print(f"{appsDF.shape[0]:,} rows, {appsDF.shape[1]:,} columns")
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
1,670,214 rows, 37 columns
print(f"There are {appsDF.shape[0]:,} previous applications")
There are 1,670,214 previous applications
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_train"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 291,057
#Find the intersection of two arrays.
print(f'Number of train applicants with previous applications is {len(np.intersect1d(datasets["previous_application"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])):,}')
Number of train applicants with previous applications is 47,800
# How many previous applciations per applicant in the previous_application
prevAppCounts = appsDF['SK_ID_CURR'].value_counts(dropna=False)
len(prevAppCounts[prevAppCounts >40]) #more that 40 previous applications
plt.hist(prevAppCounts[prevAppCounts>=0], bins=100)
plt.grid()
prevAppCounts[prevAppCounts >50].plot(kind='bar')
plt.xticks(rotation=50)
plt.show()
sum(appsDF['SK_ID_CURR'].value_counts()==1)
60458
plt.hist(appsDF['SK_ID_CURR'].value_counts(), cumulative =True, bins = 100);
plt.grid()
plt.ylabel('cumulative number of IDs')
plt.xlabel('Number of previous applications per ID')
plt.title('Histogram of Number of previous applications for an ID')
Text(0.5, 1.0, 'Histogram of Number of previous applications for an ID')
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)
apps_all = appsDF['SK_ID_CURR'].nunique()
apps_5plus = appsDF['SK_ID_CURR'].value_counts()>=5
apps_40plus = appsDF['SK_ID_CURR'].value_counts()>=40
print('Percentage with 10 or more previous apps:', np.round(100.*(sum(apps_5plus)/apps_all),5))
print('Percentage with 40 or more previous apps:', np.round(100.*(sum(apps_40plus)/apps_all),5))
Percentage with 10 or more previous apps: 41.76895 Percentage with 40 or more previous apps: 0.03453
plt.figure(figsize=(8, 8))
plt.pie(x=appsDF['NAME_CONTRACT_STATUS'].value_counts(),
radius=1.3-0.3,
labels=appsDF['NAME_CONTRACT_STATUS'].value_counts().index,
autopct='%1.1f%%',
colors=['crimson', 'yellow', 'blue', 'palegreen'],
wedgeprops={"edgecolor":"0", "width":0.5},
startangle=160,
shadow=True,
textprops={'fontsize': 15})
plt.ylabel('', fontsize=145)
plt.title("Applicant's Previous Contract Status ", fontsize=15)
plt.show()
Observation from above
The majority of the applicants had their contracts approved in prior applications.
36% of candidates had their contracts refused or terminated, and the remaining 1.6% never used their contracts.
plt.figure(figsize=(8, 8))
plt.pie(x=appsDF['NAME_CLIENT_TYPE'].value_counts(),
radius=1.3-0.3,
labels=appsDF['NAME_CLIENT_TYPE'].value_counts().index,
autopct='%1.1f%%',
colors=['red', 'blue', 'green', 'blue'],
explode=[0.2,0,0.2,0],
wedgeprops={"edgecolor":"0", "width":0.5},
startangle=160,
shadow=True,
textprops={'fontsize': 15})
plt.ylabel('', fontsize=15)
plt.title("Type of Client in Previous Application", fontsize=15)
plt.show()
Observation from above
plt.figure(figsize=(14, 7))
sns.countplot(x='CHANNEL_TYPE', data=appsDF, palette='rainbow')
plt.xlabel('Channel Type', fontsize=15)
plt.ylabel('Count', fontsize=15)
plt.xticks(fontsize=11, rotation=55)
plt.yticks(fontsize=11)
plt.grid(True)
plt.title("Applicant's channel in previous application", fontsize=15)
plt.plot();
Observation from above
#Feature Engineering
#**Performing Encoding on Application train and Application Test's Categorical Features**
# Label Encoding
# Create a label encoder object
le = LabelEncoder()
le_count = 0
# Iterating through the columns
for col in ap_train_data:
if ap_train_data[col].dtype == 'object':
# If two or fewer unique categories
if len(list(ap_train_data[col].unique())) <= 2:
# Train on the training data
le.fit(ap_train_data[col])
# Transforming both training and testing data
ap_train_data[col] = le.transform(ap_train_data[col])
ap_test_data[col] = le.transform(ap_test_data[col])
# Tracking of how many columns were label encoded
le_count += 1
print('%d columns were label encoded.' % le_count)
3 columns were label encoded.
# one-hot encoding of features
ap_train_data = pd.get_dummies(ap_train_data)
ap_test_data = pd.get_dummies(ap_test_data)
print('Training Features shape: ', ap_train_data.shape)
print('Testing Features shape: ', ap_test_data.shape)
Training Features shape: (307511, 243) Testing Features shape: (48744, 239)
training_labels = ap_train_data['TARGET']
# Aligning the training and testing data, keeping only columns present in both dataframes
ap_train_data, ap_test_data = ap_train_data.align(ap_test_data, join = 'inner', axis = 1)
# Add the target back in
ap_train_data['TARGET'] = training_labels
print('Training Features shape: ', ap_train_data.shape)
print('Testing Features shape: ', ap_test_data.shape)
Training Features shape: (307511, 240) Testing Features shape: (48744, 239)
ap_train_data.to_csv('ap_train.csv', index=False)
ap_test_data.to_csv('ap_test.csv', index=False)
# Modeling and Baseline
#Numerical features used:
print(*list(numerical_features), sep = ",")
SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,DAYS_ID_PUBLISH,OWN_CAR_AGE,FLAG_MOBIL,FLAG_EMP_PHONE,FLAG_WORK_PHONE,FLAG_CONT_MOBILE,FLAG_PHONE,FLAG_EMAIL,CNT_FAM_MEMBERS,REGION_RATING_CLIENT,REGION_RATING_CLIENT_W_CITY,HOUR_APPR_PROCESS_START,REG_REGION_NOT_LIVE_REGION,REG_REGION_NOT_WORK_REGION,LIVE_REGION_NOT_WORK_REGION,REG_CITY_NOT_LIVE_CITY,REG_CITY_NOT_WORK_CITY,LIVE_CITY_NOT_WORK_CITY,EXT_SOURCE_1,EXT_SOURCE_2,EXT_SOURCE_3,APARTMENTS_AVG,BASEMENTAREA_AVG,YEARS_BEGINEXPLUATATION_AVG,YEARS_BUILD_AVG,COMMONAREA_AVG,ELEVATORS_AVG,ENTRANCES_AVG,FLOORSMAX_AVG,FLOORSMIN_AVG,LANDAREA_AVG,LIVINGAPARTMENTS_AVG,LIVINGAREA_AVG,NONLIVINGAPARTMENTS_AVG,NONLIVINGAREA_AVG,APARTMENTS_MODE,BASEMENTAREA_MODE,YEARS_BEGINEXPLUATATION_MODE,YEARS_BUILD_MODE,COMMONAREA_MODE,ELEVATORS_MODE,ENTRANCES_MODE,FLOORSMAX_MODE,FLOORSMIN_MODE,LANDAREA_MODE,LIVINGAPARTMENTS_MODE,LIVINGAREA_MODE,NONLIVINGAPARTMENTS_MODE,NONLIVINGAREA_MODE,APARTMENTS_MEDI,BASEMENTAREA_MEDI,YEARS_BEGINEXPLUATATION_MEDI,YEARS_BUILD_MEDI,COMMONAREA_MEDI,ELEVATORS_MEDI,ENTRANCES_MEDI,FLOORSMAX_MEDI,FLOORSMIN_MEDI,LANDAREA_MEDI,LIVINGAPARTMENTS_MEDI,LIVINGAREA_MEDI,NONLIVINGAPARTMENTS_MEDI,NONLIVINGAREA_MEDI,TOTALAREA_MODE,OBS_30_CNT_SOCIAL_CIRCLE,DEF_30_CNT_SOCIAL_CIRCLE,OBS_60_CNT_SOCIAL_CIRCLE,DEF_60_CNT_SOCIAL_CIRCLE,DAYS_LAST_PHONE_CHANGE,FLAG_DOCUMENT_2,FLAG_DOCUMENT_3,FLAG_DOCUMENT_4,FLAG_DOCUMENT_5,FLAG_DOCUMENT_6,FLAG_DOCUMENT_7,FLAG_DOCUMENT_8,FLAG_DOCUMENT_9,FLAG_DOCUMENT_10,FLAG_DOCUMENT_11,FLAG_DOCUMENT_12,FLAG_DOCUMENT_13,FLAG_DOCUMENT_14,FLAG_DOCUMENT_15,FLAG_DOCUMENT_16,FLAG_DOCUMENT_17,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
# Categorical Features used:
print(*list(categorical_features), sep = ",")
NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,OCCUPATION_TYPE,WEEKDAY_APPR_PROCESS_START,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?
previous_application with application_x¶We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.
Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:
AMT_APPLICATION, AMT_CREDIT could be based on average, min, max, median, etc.To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).
When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]I want you to think about this section and build on this.
application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset)), thereby leading to X_train, y_train, X_valid, etc.df = pd.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[np.nan, np.nan, np.nan]],
columns=['A', 'B', 'C'])
display(df)
| A | B | C | |
|---|---|---|---|
| 0 | 1.0 | 2.0 | 3.0 |
| 1 | 4.0 | 5.0 | 6.0 |
| 2 | 7.0 | 8.0 | 9.0 |
| 3 | NaN | NaN | NaN |
df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
# A B
#max NaN 8.0
#min 1.0 2.0
#sum 12.0 NaN
| A | B | |
|---|---|---|
| sum | 12.0 | NaN |
| min | 1.0 | 2.0 |
| max | NaN | 8.0 |
df = pd.DataFrame({'A': [1, 1, 2, 2],
'B': [1, 2, 3, 4],
'C': np.random.randn(4)})
display(df)
| A | B | C | |
|---|---|---|---|
| 0 | 1 | 1 | 0.586963 |
| 1 | 1 | 2 | -0.368022 |
| 2 | 2 | 3 | -0.224149 |
| 3 | 2 | 4 | -0.355347 |
# group by column A:
df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
# B C
# min max sum
#A
#1 1 2 0.590716
#2 3 4 0.704907
| B | C | ||
|---|---|---|---|
| min | max | sum | |
| A | |||
| 1 | 1 | 2 | 0.218941 |
| 2 | 3 | 4 | -0.579496 |
appsDF.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
funcs = ["a","b","c"]
{f:f"{f}_max" for f in funcs}
{'a': 'a_max', 'b': 'b_max', 'c': 'c_max'}
So far, both our boolean selections have involved a single condition. You can, of course, have as many conditions as you would like. To do so, you will need to combine your boolean expressions using the three logical operators and, or and not.
Use &, | , ~ Although Python uses the syntax and, or, and not, these will not work when testing multiple conditions with pandas. The details of why are explained here.
You must use the following operators with pandas:
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 37 columns
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704)]["AMT_CREDIT"]
6 0.0 Name: AMT_CREDIT, dtype: float64
appsDF[0:50][(appsDF["SK_ID_CURR"]==175704) & ~(appsDF["AMT_CREDIT"]==1.0)]
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 2315218 | 175704 | Cash loans | NaN | 0.0 | 0.0 | NaN | NaN | TUESDAY | 11 | ... | XNA | NaN | XNA | Cash | NaN | NaN | NaN | NaN | NaN | NaN |
1 rows × 37 columns
appsDF.isna().sum()
SK_ID_PREV 0 SK_ID_CURR 0 NAME_CONTRACT_TYPE 0 AMT_ANNUITY 372235 AMT_APPLICATION 0 AMT_CREDIT 1 AMT_DOWN_PAYMENT 895844 AMT_GOODS_PRICE 385515 WEEKDAY_APPR_PROCESS_START 0 HOUR_APPR_PROCESS_START 0 FLAG_LAST_APPL_PER_CONTRACT 0 NFLAG_LAST_APPL_IN_DAY 0 RATE_DOWN_PAYMENT 895844 RATE_INTEREST_PRIMARY 1664263 RATE_INTEREST_PRIVILEGED 1664263 NAME_CASH_LOAN_PURPOSE 0 NAME_CONTRACT_STATUS 0 DAYS_DECISION 0 NAME_PAYMENT_TYPE 0 CODE_REJECT_REASON 0 NAME_TYPE_SUITE 820405 NAME_CLIENT_TYPE 0 NAME_GOODS_CATEGORY 0 NAME_PORTFOLIO 0 NAME_PRODUCT_TYPE 0 CHANNEL_TYPE 0 SELLERPLACE_AREA 0 NAME_SELLER_INDUSTRY 0 CNT_PAYMENT 372230 NAME_YIELD_GROUP 0 PRODUCT_COMBINATION 346 DAYS_FIRST_DRAWING 673065 DAYS_FIRST_DUE 673065 DAYS_LAST_DUE_1ST_VERSION 673065 DAYS_LAST_DUE 673065 DAYS_TERMINATION 673065 NFLAG_INSURED_ON_APPROVAL 673065 dtype: int64
appsDF.columns
Index(['SK_ID_PREV', 'SK_ID_CURR', 'NAME_CONTRACT_TYPE', 'AMT_ANNUITY',
'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
'RATE_INTEREST_PRIVILEGED', 'NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL'],
dtype='object')
# appsDF[agg_op_features].head()
agg_ops
['min', 'max', 'mean']
The groupby output will have an index or multi-index on rows corresponding to your chosen grouping variables. To avoid setting this index, pass “as_index=False” to the groupby operation.
import pandas as pd
import dateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
data.groupby('month', as_index=False).agg({"duration": "sum"})
Pandas reset_index() to convert Multi-Index to Columns
We can simplify the multi-index dataframe using reset_index() function in Pandas. By default, Pandas reset_index() converts the indices to columns.
Since we have both the variable name and the operation performed in two rows in the Multi-Index dataframe, we can use that and name our new columns correctly.
For more details unstacking groupby results and examples please see here
For more details and examples please see here
features = ['AMT_ANNUITY', 'AMT_APPLICATION']
print(f"{appsDF[features].describe()}")
agg_ops = ["min", "max", "mean"]
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg("mean") #group by ID
display(result.head())
print("-"*50)
result = appsDF.groupby(["SK_ID_CURR"], as_index=False).agg({'AMT_ANNUITY' : agg_ops, 'AMT_APPLICATION' : agg_ops})
result.columns = result.columns.map('_'.join)
display(result)
result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
print(f"result.shape: {result.shape}")
result[0:10]
AMT_ANNUITY AMT_APPLICATION count 1.297979e+06 1.670214e+06 mean 1.595512e+04 1.752339e+05 std 1.478214e+04 2.927798e+05 min 0.000000e+00 0.000000e+00 25% 6.321780e+03 1.872000e+04 50% 1.125000e+04 7.104600e+04 75% 2.065842e+04 1.803600e+05 max 4.180581e+05 6.905160e+06
| SK_ID_CURR | SK_ID_PREV | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | HOUR_APPR_PROCESS_START | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | ... | RATE_INTEREST_PRIVILEGED | DAYS_DECISION | SELLERPLACE_AREA | CNT_PAYMENT | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 1.369693e+06 | 3951.000 | 24835.50 | 23787.00 | 2520.0 | 24835.5 | 13.000000 | 1.0 | 0.104326 | ... | NaN | -1740.0 | 23.0 | 8.0 | 365243.0 | -1709.000000 | -1499.000000 | -1619.000000 | -1612.000000 | 0.000000 |
| 1 | 100002 | 1.038818e+06 | 9251.775 | 179055.00 | 179055.00 | 0.0 | 179055.0 | 9.000000 | 1.0 | 0.000000 | ... | NaN | -606.0 | 500.0 | 24.0 | 365243.0 | -565.000000 | 125.000000 | -25.000000 | -17.000000 | 0.000000 |
| 2 | 100003 | 2.281150e+06 | 56553.990 | 435436.50 | 484191.00 | 3442.5 | 435436.5 | 14.666667 | 1.0 | 0.050030 | ... | NaN | -1305.0 | 533.0 | 10.0 | 365243.0 | -1274.333333 | -1004.333333 | -1054.333333 | -1047.333333 | 0.666667 |
| 3 | 100004 | 1.564014e+06 | 5357.250 | 24282.00 | 20106.00 | 4860.0 | 24282.0 | 5.000000 | 1.0 | 0.212008 | ... | NaN | -815.0 | 30.0 | 4.0 | 365243.0 | -784.000000 | -694.000000 | -724.000000 | -714.000000 | 0.000000 |
| 4 | 100005 | 2.176837e+06 | 4813.200 | 22308.75 | 20076.75 | 4464.0 | 44617.5 | 10.500000 | 1.0 | 0.108964 | ... | NaN | -536.0 | 18.0 | 12.0 | 365243.0 | -706.000000 | -376.000000 | -466.000000 | -460.000000 | 0.000000 |
5 rows × 21 columns
--------------------------------------------------
| SK_ID_CURR_ | AMT_ANNUITY_min | AMT_ANNUITY_max | AMT_ANNUITY_mean | AMT_APPLICATION_min | AMT_APPLICATION_max | AMT_APPLICATION_mean | |
|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 3951.000 | 3951.000000 | 24835.5 | 24835.5 | 24835.500 |
| 1 | 100002 | 9251.775 | 9251.775 | 9251.775000 | 179055.0 | 179055.0 | 179055.000 |
| 2 | 100003 | 6737.310 | 98356.995 | 56553.990000 | 68809.5 | 900000.0 | 435436.500 |
| 3 | 100004 | 5357.250 | 5357.250 | 5357.250000 | 24282.0 | 24282.0 | 24282.000 |
| 4 | 100005 | 4813.200 | 4813.200 | 4813.200000 | 0.0 | 44617.5 | 22308.750 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 338852 | 456251 | 6605.910 | 6605.910 | 6605.910000 | 40455.0 | 40455.0 | 40455.000 |
| 338853 | 456252 | 10074.465 | 10074.465 | 10074.465000 | 57595.5 | 57595.5 | 57595.500 |
| 338854 | 456253 | 3973.095 | 5567.715 | 4770.405000 | 19413.0 | 28912.5 | 24162.750 |
| 338855 | 456254 | 2296.440 | 19065.825 | 10681.132500 | 18846.0 | 223789.5 | 121317.750 |
| 338856 | 456255 | 2250.000 | 54022.140 | 20775.391875 | 45000.0 | 1170000.0 | 362770.875 |
338857 rows × 7 columns
result.shape: (338857, 8)
| SK_ID_CURR_ | AMT_ANNUITY_min | AMT_ANNUITY_max | AMT_ANNUITY_mean | AMT_APPLICATION_min | AMT_APPLICATION_max | AMT_APPLICATION_mean | range_AMT_APPLICATION | |
|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 3951.000 | 3951.000 | 3951.000000 | 24835.5 | 24835.5 | 24835.500000 | 0.0 |
| 1 | 100002 | 9251.775 | 9251.775 | 9251.775000 | 179055.0 | 179055.0 | 179055.000000 | 0.0 |
| 2 | 100003 | 6737.310 | 98356.995 | 56553.990000 | 68809.5 | 900000.0 | 435436.500000 | 831190.5 |
| 3 | 100004 | 5357.250 | 5357.250 | 5357.250000 | 24282.0 | 24282.0 | 24282.000000 | 0.0 |
| 4 | 100005 | 4813.200 | 4813.200 | 4813.200000 | 0.0 | 44617.5 | 22308.750000 | 44617.5 |
| 5 | 100006 | 2482.920 | 39954.510 | 23651.175000 | 0.0 | 688500.0 | 272203.260000 | 688500.0 |
| 6 | 100007 | 1834.290 | 22678.785 | 12278.805000 | 17176.5 | 247500.0 | 150530.250000 | 230323.5 |
| 7 | 100008 | 8019.090 | 25309.575 | 15839.696250 | 0.0 | 450000.0 | 155701.800000 | 450000.0 |
| 8 | 100009 | 7435.845 | 17341.605 | 10051.412143 | 40455.0 | 110160.0 | 76741.714286 | 69705.0 |
| 9 | 100010 | 27463.410 | 27463.410 | 27463.410000 | 247212.0 | 247212.0 | 247212.000000 | 0.0 |
result.isna().sum()
SK_ID_CURR_ 0 AMT_ANNUITY_min 480 AMT_ANNUITY_max 480 AMT_ANNUITY_mean 480 AMT_APPLICATION_min 0 AMT_APPLICATION_max 0 AMT_APPLICATION_mean 0 range_AMT_APPLICATION 0 dtype: int64
# # Create aggregate features (via pipeline)
# class prevAppsFeaturesAggregater(BaseEstimator, TransformerMixin):
# def __init__(self, features=None): # no *args or **kargs
# self.features = features
# self.agg_op_features = {}
# for f in features:
# self.agg_op_features[f] = {f"{f}_{func}":func for func in ["min", "max", "mean"]}
# def fit(self, X, y=None):
# return self
# def transform(self, X, y=None):
# #from IPython.core.debugger import Pdb as pdb; pdb().set_trace() #breakpoint; dont forget to quit
# result = X.groupby(["SK_ID_CURR"]).agg(self.agg_op_features)
# result.columns = result.columns.droplevel()
# result = result.reset_index(level=["SK_ID_CURR"])
# result['range_AMT_APPLICATION'] = result['AMT_APPLICATION_max'] - result['AMT_APPLICATION_min']
# return result # return dataframe with the join key "SK_ID_CURR"
# from sklearn.pipeline import make_pipeline
# def test_driver_prevAppsFeaturesAggregater(df, features):
# print(f"df.shape: {df.shape}\n")
# print(f"df[{features}][0:5]: \n{df[features][0:5]}")
# test_pipeline = make_pipeline(prevAppsFeaturesAggregater(features))
# return(test_pipeline.fit_transform(df))
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# features = ['AMT_ANNUITY',
# 'AMT_APPLICATION', 'AMT_CREDIT', 'AMT_DOWN_PAYMENT', 'AMT_GOODS_PRICE',
# 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY',
# 'RATE_INTEREST_PRIVILEGED', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
# 'CNT_PAYMENT',
# 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION',
# 'DAYS_LAST_DUE', 'DAYS_TERMINATION']
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# res = test_driver_prevAppsFeaturesAggregater(appsDF, features)
# print(f"HELLO")
# print(f"Test driver: \n{res[0:10]}")
# print(f"input[features][0:10]: \n{appsDF[0:10]}")
# # QUESTION, should we lower case df['OCCUPATION_TYPE'] as Sales staff != 'Sales Staff'? (hint: YES)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.impute import SimpleImputer
# Make a new dataframe for polynomial features
poly_features_df = ap_train_data[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'TARGET']]
poly_features_test_df = ap_test_data[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH']]
# imputer for handling missing values
# from sklearn.preprocessing import Imputer
imputer = SimpleImputer(strategy = 'median')
pn_target = poly_features_df['TARGET']
poly_features_df = poly_features_df.drop(columns = ['TARGET'])
# Need to impute missing values
poly_features_df = imputer.fit_transform(poly_features_df)
poly_features_test_df = imputer.transform(poly_features_test_df)
# Create the polynomial object with specified degree
pn_transformer = PolynomialFeatures(degree = 3)
# Train the polynomial features
pn_transformer.fit(poly_features_df)
# Transform the features
poly_features_df = pn_transformer.transform(poly_features_df)
poly_features_test_df = pn_transformer.transform(poly_features_test_df)
print('Shape of polynomial features: ', poly_features_df.shape)
Shape of polynomial features: (307511, 35)
pn_transformer.get_feature_names(input_features = ['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'DAYS_EMPLOYED'])[:20]
['1', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'EXT_SOURCE_1^2', 'EXT_SOURCE_1 EXT_SOURCE_2', 'EXT_SOURCE_1 EXT_SOURCE_3', 'EXT_SOURCE_1 DAYS_BIRTH', 'EXT_SOURCE_2^2', 'EXT_SOURCE_2 EXT_SOURCE_3', 'EXT_SOURCE_2 DAYS_BIRTH', 'EXT_SOURCE_3^2', 'EXT_SOURCE_3 DAYS_BIRTH', 'DAYS_BIRTH^2', 'EXT_SOURCE_1^3', 'EXT_SOURCE_1^2 EXT_SOURCE_2', 'EXT_SOURCE_1^2 EXT_SOURCE_3', 'EXT_SOURCE_1^2 DAYS_BIRTH', 'EXT_SOURCE_1 EXT_SOURCE_2^2']
# Create a dataframe of the features
poly_features_df = pd.DataFrame(poly_features_df,
columns = pn_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH']))
# Add in the target
poly_features_df['TARGET'] = pn_target
# Find the correlations with the target
pn_corrs = poly_features_df.corr()['TARGET'].sort_values()
# Display most negative and most positive
print(pn_corrs.head(10))
print(pn_corrs.tail(5))
EXT_SOURCE_2 EXT_SOURCE_3 -0.193939 EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 -0.189605 EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH -0.181283 EXT_SOURCE_2^2 EXT_SOURCE_3 -0.176428 EXT_SOURCE_2 EXT_SOURCE_3^2 -0.172282 EXT_SOURCE_1 EXT_SOURCE_2 -0.166625 EXT_SOURCE_1 EXT_SOURCE_3 -0.164065 EXT_SOURCE_2 -0.160295 EXT_SOURCE_2 DAYS_BIRTH -0.156873 EXT_SOURCE_1 EXT_SOURCE_2^2 -0.156867 Name: TARGET, dtype: float64 DAYS_BIRTH -0.078239 DAYS_BIRTH^2 -0.076672 DAYS_BIRTH^3 -0.074273 TARGET 1.000000 1 NaN Name: TARGET, dtype: float64
# copying test features into dataframe
poly_features_test_df = pd.DataFrame(poly_features_test_df,
columns = pn_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH',
'DAYS_EMPLOYED']))
# Merge polynomial features into training dataframe
poly_features_df['SK_ID_CURR'] = ap_train_data['SK_ID_CURR']
ap_train_data_poly = ap_train_data.merge(poly_features_df, on = 'SK_ID_CURR', how = 'left')
# Merge polnomial features into testing dataframe
poly_features_test_df['SK_ID_CURR'] = ap_test_data['SK_ID_CURR']
ap_test_data_poly = ap_test_data.merge(poly_features_test_df, on = 'SK_ID_CURR', how = 'left')
# Align the dataframes
ap_train_data_poly, ap_test_data_poly = ap_train_data_poly.align(ap_test_data_poly, join = 'inner', axis = 1)
# Print out the new shapes
print('Training data with polynomial features shape: ', ap_train_data_poly.shape)
print('Testing data with polynomial features shape: ', ap_test_data_poly.shape)
Training data with polynomial features shape: (307511, 274) Testing data with polynomial features shape: (48744, 274)
ap_train_data_poly['TARGET'] = ap_train_data['TARGET']
ap_train_data_poly.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | ... | EXT_SOURCE_2^2 EXT_SOURCE_3 | EXT_SOURCE_2^2 DAYS_BIRTH | EXT_SOURCE_2 EXT_SOURCE_3^2 | EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH | EXT_SOURCE_2 DAYS_BIRTH^2 | EXT_SOURCE_3^3 | EXT_SOURCE_3^2 DAYS_BIRTH | EXT_SOURCE_3 DAYS_BIRTH^2 | DAYS_BIRTH^3 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 0 | 0 | 1 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | ... | 0.009637 | 654.152107 | 0.005108 | 346.733022 | 2.353667e+07 | 0.002707 | 183.785678 | 1.247560e+07 | 8.468590e+11 | 1 |
| 1 | 100003 | 0 | 0 | 0 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | ... | 0.207254 | 6491.237078 | 0.178286 | 5583.975307 | 1.748916e+08 | 0.153368 | 4803.518937 | 1.504475e+08 | 4.712058e+12 | 0 |
| 2 | 100004 | 1 | 1 | 1 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | ... | 0.225464 | 5885.942404 | 0.295894 | 7724.580288 | 2.016572e+08 | 0.388325 | 10137.567875 | 2.646504e+08 | 6.908939e+12 | 0 |
| 3 | 100006 | 0 | 0 | 1 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | ... | 0.226462 | 8040.528832 | 0.186365 | 6616.894625 | 2.349331e+08 | 0.153368 | 5445.325225 | 1.933364e+08 | 6.864416e+12 | 0 |
| 4 | 100007 | 0 | 0 | 1 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | ... | 0.055754 | 2076.117157 | 0.092471 | 3443.335521 | 1.282190e+08 | 0.153368 | 5710.929881 | 2.126570e+08 | 7.918677e+12 | 0 |
5 rows × 275 columns
ap_train_data_poly['TARGET'].head()
0 1 1 0 2 0 3 0 4 0 Name: TARGET, dtype: int64
ap_test_data_poly.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | ... | EXT_SOURCE_2^3 | EXT_SOURCE_2^2 EXT_SOURCE_3 | EXT_SOURCE_2^2 DAYS_BIRTH | EXT_SOURCE_2 EXT_SOURCE_3^2 | EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH | EXT_SOURCE_2 DAYS_BIRTH^2 | EXT_SOURCE_3^3 | EXT_SOURCE_3^2 DAYS_BIRTH | EXT_SOURCE_3 DAYS_BIRTH^2 | DAYS_BIRTH^3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | 0 | 0 | 1 | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | 0.018850 | ... | 0.492392 | 0.099469 | 11997.802403 | 0.020094 | 2423.698322 | 2.923427e+08 | 0.004059 | 489.615795 | 5.905670e+07 | 7.123328e+12 |
| 1 | 100005 | 0 | 0 | 1 | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | 0.035792 | ... | 0.024809 | 0.036829 | 1536.577117 | 0.054673 | 2281.043619 | 9.516956e+07 | 0.081161 | 3386.201665 | 1.412789e+08 | 5.894429e+12 |
| 2 | 100013 | 0 | 1 | 1 | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | 0.019101 | ... | 0.342687 | 0.299203 | 9812.640816 | 0.261238 | 8567.521115 | 2.809794e+08 | 0.228089 | 7480.393855 | 2.453261e+08 | 8.045687e+12 |
| 3 | 100028 | 0 | 0 | 1 | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | 0.026392 | ... | 0.132399 | 0.159163 | 3630.555667 | 0.191336 | 4364.443591 | 9.955450e+07 | 0.230013 | 5246.681115 | 1.196786e+08 | 2.729912e+12 |
| 4 | 100038 | 0 | 1 | 0 | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | 0.010032 | ... | 0.077139 | 0.096997 | 2362.974127 | 0.121968 | 2971.298294 | 7.238455e+07 | 0.153368 | 3736.229463 | 9.101923e+07 | 2.217342e+12 |
5 rows × 274 columns
~3==3
False
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
# features = ['AMT_ANNUITY', 'AMT_APPLICATION']
# prevApps_feature_pipeline = Pipeline([
# ('prevApps_add_features1', prevApps_add_features1()), # add some new features
# ('prevApps_add_features2', prevApps_add_features2()), # add some new features
# ('prevApps_aggregater', prevAppsFeaturesAggregater()), # Aggregate across old and new features
# ])
#X_train= datasets["application_train"] #primary dataset
merged_data_df=ap_train_data_poly#primary dataset
appsDF = datasets["previous_application"] #prev app
merge_all_data = False
# transform all the secondary tables
# 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
# 'previous_application', 'POS_CASH_balance'
if merge_all_data:
prevApps_aggregated = prevApps_feature_pipeline.transform(appsDF)
#'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
# 'previous_application', 'POS_CASH_balance'
# merge primary table and secondary tables using features based on meta data and aggregage stats
if merge_all_data:
# 1. Join/Merge in prevApps Data
X_train = X_train.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# 2. Join/Merge in ...... Data
#X_train = X_train.merge(...._aggregated, how='left', on="SK_ID_CURR")
# 3. Join/Merge in .....Data
#dX_train = X_train.merge(...._aggregated, how='left', on="SK_ID_CURR")
# 4. Join/Merge in Aggregated ...... Data
#X_train = X_train.merge(...._aggregated, how='left', on="SK_ID_CURR")
# .....
merged_data_df.head(10)
| SK_ID_CURR | NAME_CONTRACT_TYPE | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | ... | EXT_SOURCE_2^2 EXT_SOURCE_3 | EXT_SOURCE_2^2 DAYS_BIRTH | EXT_SOURCE_2 EXT_SOURCE_3^2 | EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH | EXT_SOURCE_2 DAYS_BIRTH^2 | EXT_SOURCE_3^3 | EXT_SOURCE_3^2 DAYS_BIRTH | EXT_SOURCE_3 DAYS_BIRTH^2 | DAYS_BIRTH^3 | TARGET | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 0 | 0 | 1 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | ... | 0.009637 | 654.152107 | 0.005108 | 346.733022 | 2.353667e+07 | 0.002707 | 183.785678 | 1.247560e+07 | 8.468590e+11 | 1 |
| 1 | 100003 | 0 | 0 | 0 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | ... | 0.207254 | 6491.237078 | 0.178286 | 5583.975307 | 1.748916e+08 | 0.153368 | 4803.518937 | 1.504475e+08 | 4.712058e+12 | 0 |
| 2 | 100004 | 1 | 1 | 1 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | ... | 0.225464 | 5885.942404 | 0.295894 | 7724.580288 | 2.016572e+08 | 0.388325 | 10137.567875 | 2.646504e+08 | 6.908939e+12 | 0 |
| 3 | 100006 | 0 | 0 | 1 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | ... | 0.226462 | 8040.528832 | 0.186365 | 6616.894625 | 2.349331e+08 | 0.153368 | 5445.325225 | 1.933364e+08 | 6.864416e+12 | 0 |
| 4 | 100007 | 0 | 0 | 1 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | ... | 0.055754 | 2076.117157 | 0.092471 | 3443.335521 | 1.282190e+08 | 0.153368 | 5710.929881 | 2.126570e+08 | 7.918677e+12 | 0 |
| 5 | 100008 | 0 | 0 | 1 | 0 | 99000.0 | 490495.5 | 27517.5 | 454500.0 | 0.035792 | ... | 0.077948 | 2125.674698 | 0.136703 | 3727.930293 | 1.016616e+08 | 0.239745 | 6537.907365 | 1.782904e+08 | 4.862024e+12 | 0 |
| 6 | 100009 | 0 | 1 | 1 | 1 | 171000.0 | 1560726.0 | 41301.0 | 1395000.0 | 0.035792 | ... | 0.257926 | 7222.093969 | 0.175297 | 4908.432271 | 1.374393e+08 | 0.119139 | 3335.972567 | 9.340938e+07 | 2.615523e+12 | 0 |
| 7 | 100010 | 0 | 1 | 1 | 0 | 360000.0 | 1530000.0 | 42075.0 | 1530000.0 | 0.003122 | ... | 0.275839 | 9617.173847 | 0.208789 | 7279.460485 | 2.537995e+08 | 0.158037 | 5509.991376 | 1.921067e+08 | 6.697829e+12 | 0 |
| 8 | 100011 | 0 | 0 | 1 | 0 | 112500.0 | 1019610.0 | 33826.5 | 913500.0 | 0.018634 | ... | 0.031822 | 850.829793 | 0.116265 | 3108.614159 | 8.311569e+07 | 0.424790 | 11357.714633 | 3.036737e+08 | 8.119389e+12 | 0 |
| 9 | 100012 | 1 | 0 | 1 | 0 | 135000.0 | 405000.0 | 20250.0 | 405000.0 | 0.019689 | ... | 0.298404 | 8066.130508 | 0.213929 | 5782.689256 | 1.563113e+08 | 0.153368 | 4145.667492 | 1.120611e+08 | 3.029114e+12 | 0 |
10 rows × 275 columns
# X_kaggle_test= datasets["application_test"]
# if merge_all_data:
# # 1. Join/Merge in prevApps Data
# X_kaggle_test = X_kaggle_test.merge(prevApps_aggregated, how='left', on='SK_ID_CURR')
# # 2. Join/Merge in ...... Data
# #X_train = X_train.merge(...._aggregated, how='left', on="SK_ID_CURR")
# # 3. Join/Merge in .....Data
# #df_labeled = df_labeled.merge(...._aggregated, how='left', on="SK_ID_CURR")
# # 4. Join/Merge in Aggregated ...... Data
# #df_labeled = df_labeled.merge(...._aggregated, how='left', on="SK_ID_CURR")
# # ......
# approval rate 'NFLAG_INSURED_ON_APPROVAL'
ap_train_data_poly.to_csv('app_train_pn.csv', index=False)
ap_test_data_poly.to_csv('app_test_pn.csv', index=False)
# Convert categorical features to numerical approximations (via pipeline)
class ClaimAttributesAdder(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
charlson_idx_dt = {'0': 0, '1-2': 2, '3-4': 4, '5+': 6}
los_dt = {'1 day': 1, '2 days': 2, '3 days': 3, '4 days': 4, '5 days': 5, '6 days': 6,
'1- 2 weeks': 11, '2- 4 weeks': 21, '4- 8 weeks': 42, '26+ weeks': 180}
X['PayDelay'] = X['PayDelay'].apply(lambda x: int(x) if x != '162+' else int(162))
X['DSFS'] = X['DSFS'].apply(lambda x: None if pd.isnull(x) else int(x[0]) + 1)
X['CharlsonIndex'] = X['CharlsonIndex'].apply(lambda x: charlson_idx_dt[x])
X['LengthOfStay'] = X['LengthOfStay'].apply(lambda x: None if pd.isnull(x) else los_dt[x])
return X
Train, validation and Test sets (and the leakage problem we have mentioned previously):
Let's look at a small usecase to tell us how to deal with this:
ValueError. This is because the there are new, previously unseen unique values in the test set and the encoder doesn’t know how to handle these values. In order to use both the transformed training and test sets in machine learning algorithms, we need them to have the same number of columns.This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.
Here is a example that in action:
# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
Please this blog for more details of OHE when the validation/test have previously unseen unique values.
# Create a class to select numerical or categorical columns
# since Scikit-Learn doesn't handle DataFrames yet
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify the numeric features we wish to consider.
num_attribs =[
'AMT_INCOME_TOTAL',
'AMT_CREDIT',
'EXT_SOURCE_3_x',
'EXT_SOURCE_2_x',
'EXT_SOURCE_1_x',
'EXT_SOURCE_3_y',
'EXT_SOURCE_2_y',
'EXT_SOURCE_1_y',
'DAYS_EMPLOYED',
'FLOORSMAX_AVG',
'FLOORSMAX_MEDI',
'FLOORSMAX_MODE',
'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE',
'ELEVATORS_AVG',
'REG_CITY_NOT_LIVE_CITY',
'FLAG_EMP_PHONE',
'REG_CITY_NOT_WORK_CITY',
'DAYS_ID_PUBLISH',
'DAYS_LAST_PHONE_CHANGE',
'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY',
'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT',
'AMT_REQ_CREDIT_BUREAU_YEAR',
'1',
'EXT_SOURCE_1_y',
'EXT_SOURCE_2_y',
'EXT_SOURCE_3_y',
'DAYS_BIRTH_y',
'EXT_SOURCE_1^2',
'EXT_SOURCE_1 EXT_SOURCE_2',
'EXT_SOURCE_1 EXT_SOURCE_3',
'EXT_SOURCE_1 DAYS_BIRTH',
'EXT_SOURCE_2^2',
'EXT_SOURCE_2 EXT_SOURCE_3',
'EXT_SOURCE_2 DAYS_BIRTH',
'EXT_SOURCE_3^2',
'EXT_SOURCE_3 DAYS_BIRTH',
'DAYS_BIRTH^2',
'EXT_SOURCE_1^3',
'EXT_SOURCE_1^2 EXT_SOURCE_2',
'EXT_SOURCE_1^2 EXT_SOURCE_3',
'EXT_SOURCE_1^2 DAYS_BIRTH',
'EXT_SOURCE_1 EXT_SOURCE_2^2',
'EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3',
'EXT_SOURCE_1 EXT_SOURCE_2 DAYS_BIRTH',
'EXT_SOURCE_1 EXT_SOURCE_3^2',
'EXT_SOURCE_1 EXT_SOURCE_3 DAYS_BIRTH',
'EXT_SOURCE_1 DAYS_BIRTH^2',
'EXT_SOURCE_2^3',
'EXT_SOURCE_2^2 EXT_SOURCE_3',
'EXT_SOURCE_2^2 DAYS_BIRTH',
'EXT_SOURCE_2 EXT_SOURCE_3^2',
'EXT_SOURCE_2 EXT_SOURCE_3 DAYS_BIRTH',
'EXT_SOURCE_2 DAYS_BIRTH^2',
'EXT_SOURCE_3^3',
'EXT_SOURCE_3^2 DAYS_BIRTH',
'EXT_SOURCE_3 DAYS_BIRTH^2',
'DAYS_BIRTH^3']
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', SimpleImputer(strategy='median')),
('std_scaler', StandardScaler()),
])
# Identify the categorical features we wish to consider.
cat_attribs = ['FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR']
# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('imputer', SimpleImputer(strategy='most_frequent')),
# ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_prep_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
selected_features = numerical_attris + categorical_attris
tot_features = f"{len(selected_features)}: Num:{len(numerical_attris)}, Cat:{len(categorical_attris)}"
tot_features
'67: Num:63, Cat:4'
# # Split the provided training data into training and validationa and test
# # The kaggle evaluation test set has no labels
# #
# from sklearn.model_selection import train_test_split
# use_application_data_ONLY = False #use joined data
# if use_application_data_ONLY:
# # just selected a few features for a baseline experiment
# selected_features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
# 'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
# 'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# X_train = datasets["application_train"][selected_features]
# y_train = datasets["application_train"]['TARGET']
# X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
# X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
# X_kaggle_test= datasets["application_test"][selected_features]
# # y_test = datasets["application_test"]['TARGET'] #why no TARGET?!! (hint: kaggle competition)
# selected_features = ['AMT_INCOME_TOTAL', 'AMT_CREDIT','DAYS_EMPLOYED','DAYS_BIRTH','EXT_SOURCE_1',
# 'EXT_SOURCE_2','EXT_SOURCE_3','CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE',
# 'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']
# y_train = X_train['TARGET']
# X_train = X_train[selected_features]
# X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
# X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15, random_state=42)
# X_kaggle_test= X_kaggle_test[selected_features]
# # y_test = datasets["application_test"]['TARGET'] #why no TARGET?!! (hint: kaggle competition)
# print(f"X train shape: {X_train.shape}")
# print(f"X validation shape: {X_valid.shape}")
# print(f"X test shape: {X_test.shape}")
# print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
# # Split the provided training data into training and validationa and test
# # The kaggle evaluation test set has no labels
X_train = training_dataset[selected_features]
y_train = ap_train_data_poly["TARGET"]
X_kaggle_test = ap_test_data_poly[selected_features]
subsample_rat = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, stratify=y_train,
test_size=subsample_rat, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,stratify=y_train,
test_size=0.15, random_state=42)
print(f"X train shape: {X_train.shape}")
print(f"X validation shape: {X_valid.shape}")
print(f"X test shape: {X_test.shape}")
print(f"X X_kaggle_test shape: {X_kaggle_test.shape}")
X train shape: (182968, 67) X validation shape: (32289, 67) X test shape: (92254, 67) X X_kaggle_test shape: (48744, 67)
list(datasets["application_train"].columns)
['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_15', 'FLAG_DOCUMENT_16', 'FLAG_DOCUMENT_17', 'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20', 'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR']
To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model
$$ \underset{\mathbf{\theta}}{\operatorname{argmin}}\left[\text{CXE}\right] = \underset{\mathbf{\theta}}{\operatorname{argmin}} \left[ -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} \right] $$def pct(x):
return round(100*x,3)
del expLog
try:
expLog
except NameError:
expLog = pd.DataFrame(columns=["exp_name",
"Model_name",
"Train Acc",
"Valid Acc",
"Test Acc",
"Train F1",
"Valid F1",
"Test F1",
"Train AUC",
"Valid AUC",
"Test AUC",
"Fit Time"
])
expLog
| exp_name | Model_name | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | Test F1 | Train AUC | Valid AUC | Test AUC | Fit Time |
|---|
%%time
np.random.seed(42)
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("linear", LogisticRegression())
])
model = full_pipeline_with_predictor.fit(X_train, y_train)
CPU times: user 15.3 s, sys: 13.4 s, total: 28.6 s Wall time: 5.22 s
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, plot_roc_curve
np.round(accuracy_score(y_train, model.predict(X_train)), 3)
0.919
print('The Accuracy Score for Train Dataset:', np.round(accuracy_score(y_train, model.predict(X_train)), 3))
print('The F1 Score for Train Dataset:', np.round(f1_score(y_train, model.predict(X_train)), 3))
con_train = confusion_matrix(y_train, model.predict(X_train))
con_val = confusion_matrix(y_valid, model.predict(X_valid))
con_test = confusion_matrix(y_test, model.predict(X_test))
plt.figure(figsize=(7,6))
print('The Confusion Matrix for Training Set')
sns.heatmap(con_train, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(7,6))
print('The Confusion Matrix for Validation Set')
sns.heatmap(con_val, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(7,6))
print('The Confusion Matrix for Test Set')
sns.heatmap(con_test, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(9,6))
print('The AUC-ROC for Train Set')
plot_roc_curve(model, X_train, y_train);
plt.show()
plt.figure(figsize=(9,6))
print('The AUC-ROC for Valid Set')
plot_roc_curve(model, X_valid, y_valid);
plt.show()
plt.figure(figsize=(10,8))
print('AUC-ROC for Test Set')
plot_roc_curve(model, X_test, y_test);
plt.show()
The Accuracy Score for Train Dataset: 0.919 The F1 Score for Train Dataset: 0.008 The Confusion Matrix for Training Set
The Confusion Matrix for Validation Set
The Confusion Matrix for Test Set
The AUC-ROC for Train Set
<Figure size 648x432 with 0 Axes>
The AUC-ROC for Valid Set
<Figure size 648x432 with 0 Axes>
AUC-ROC for Test Set
<Figure size 720x576 with 0 Axes>
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.
The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.
from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75
#ROC auc score
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1])
0.7376770013688367
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + [model_name] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
f1_score(y_train, model.predict(X_train)),
f1_score(y_valid, model.predict(X_valid)),
f1_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
fit_time], 4))
expLog
| exp_name | Model_name | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | Test F1 | Train AUC | Valid AUC | Test AUC | Fit Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline_67_features | Baseline XGBClassifier | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.73 | 0.7385 | 49.3522 |
| 1 | Baseline_67_features | Baseline XGBClassifier | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.73 | 0.7385 | 49.3522 |
#importing requiered libarries
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
import time
expLog.shape[1]
12
clfs = [LogisticRegression(penalty='none'),
LogisticRegression(penalty='l2'),
DecisionTreeClassifier(),
GaussianNB()]
for clf in clfs:
start_time = time.time()
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("model", clf)
])
model_name = "Baseline {}".format(type(full_pipeline_with_predictor['model']).__name__)
model = full_pipeline_with_predictor.fit(X_train, y_train)
fit_time = time.time() - start_time
print('Fit Time for {} is: {} seconds'.format(model_name, fit_time))
exp_name = f"Baseline_{len(selected_features)}_features"
print(model_name)
expLog.loc[len(expLog)] = [f"{exp_name}"] + [model_name] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
f1_score(y_train, model.predict(X_train)),
f1_score(y_valid, model.predict(X_valid)),
f1_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
fit_time], 4))
con_train = confusion_matrix(y_train, model.predict(X_train))
con_val = confusion_matrix(y_valid, model.predict(X_valid))
con_test = confusion_matrix(y_test, model.predict(X_test))
plt.figure(figsize=(7,6))
print('Training Set Confusion Matrix')
sns.heatmap(con_train, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(7,6))
print('Validation Set Confusion matrix')
sns.heatmap(con_val, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(7,6))
print(' Test Set Confusion Matrix')
sns.heatmap(con_test, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(9,8))
print('AUC-ROC for Train Set')
plot_roc_curve(model, X_train, y_train);
plt.show()
plt.figure(figsize=(9,8))
print('AUC-ROC for Valid Set')
plot_roc_curve(model, X_valid, y_valid);
plt.show()
plt.figure(figsize=(9,8))
print('AUC-ROC for Test Set')
plot_roc_curve(model, X_test, y_test);
plt.show()
expLog
Fit Time for Baseline LogisticRegression is: 5.874547004699707 seconds Baseline LogisticRegression Training Set Confusion Matrix
Validation Set Confusion matrix
Test Set Confusion Matrix
AUC-ROC for Train Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Valid Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Test Set
<Figure size 648x576 with 0 Axes>
Fit Time for Baseline LogisticRegression is: 6.347308397293091 seconds Baseline LogisticRegression Training Set Confusion Matrix
Validation Set Confusion matrix
Test Set Confusion Matrix
AUC-ROC for Train Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Valid Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Test Set
<Figure size 648x576 with 0 Axes>
Fit Time for Baseline DecisionTreeClassifier is: 22.407114028930664 seconds Baseline DecisionTreeClassifier Training Set Confusion Matrix
Validation Set Confusion matrix
Test Set Confusion Matrix
AUC-ROC for Train Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Valid Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Test Set
<Figure size 648x576 with 0 Axes>
Fit Time for Baseline GaussianNB is: 1.8760206699371338 seconds Baseline GaussianNB Training Set Confusion Matrix
Validation Set Confusion matrix
Test Set Confusion Matrix
AUC-ROC for Train Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Valid Set
<Figure size 648x576 with 0 Axes>
AUC-ROC for Test Set
<Figure size 648x576 with 0 Axes>
| exp_name | Model_name | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | Test F1 | Train AUC | Valid AUC | Test AUC | Fit Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline_67_features | Baseline XGBClassifier | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.7300 | 0.7385 | 49.3522 |
| 1 | Baseline_67_features | Baseline XGBClassifier | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.7300 | 0.7385 | 49.3522 |
| 2 | Baseline_67_features | Baseline LogisticRegression | 0.9193 | 0.9194 | 0.9195 | 0.0083 | 0.0076 | 0.0093 | 0.7378 | 0.7299 | 0.7386 | 5.8745 |
| 3 | Baseline_67_features | Baseline LogisticRegression | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.7300 | 0.7385 | 6.3473 |
| 4 | Baseline_67_features | Baseline DecisionTreeClassifier | 1.0000 | 0.8503 | 0.8507 | 1.0000 | 0.1408 | 0.1515 | 1.0000 | 0.5318 | 0.5380 | 22.4071 |
| 5 | Baseline_67_features | Baseline GaussianNB | 0.6791 | 0.6747 | 0.6812 | 0.2447 | 0.2405 | 0.2472 | 0.7229 | 0.7156 | 0.7225 | 1.8760 |
#Random Forest Classifier
clfs = [RandomForestClassifier(), XGBClassifier()]
for clf in clfs:
start_time = time.time()
full_pipeline_with_predictor = Pipeline([
("preparation", data_prep_pipeline),
("model", clf)
])
model_name = "Baseline {}".format(type(full_pipeline_with_predictor['model']).__name__)
model = full_pipeline_with_predictor.fit(X_train, y_train)
fit_time = time.time() - start_time
print('Fit Time for {} is: {} seconds'.format(model_name, fit_time))
exp_name = f"Baseline_{len(selected_features)}_features"
print(model_name)
exp_name = f"Baseline_{len(selected_features)}_features"
expLog.loc[len(expLog)] = [f"{exp_name}"] + [model_name] + list(np.round(
[accuracy_score(y_train, model.predict(X_train)),
accuracy_score(y_valid, model.predict(X_valid)),
accuracy_score(y_test, model.predict(X_test)),
f1_score(y_train, model.predict(X_train)),
f1_score(y_valid, model.predict(X_valid)),
f1_score(y_test, model.predict(X_test)),
roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]),
roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1]),
roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]),
fit_time], 4))
con_train = confusion_matrix(y_train, model.predict(X_train))
con_val = confusion_matrix(y_valid, model.predict(X_valid))
con_test = confusion_matrix(y_test, model.predict(X_test))
plt.figure(figsize=(8,5))
print('Confusion Matrix for Training Set')
sns.heatmap(con_train, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(8,5))
print('Confusion Matrix for Validation Set')
sns.heatmap(con_val, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(8,5))
print('Confusion Matrix for Test Set')
sns.heatmap(con_test, annot=True, fmt='g',cmap='viridis')
plt.show()
plt.figure(figsize=(10,8))
print('AUC-ROC for Train Set')
plot_roc_curve(model, X_train, y_train);
plt.show()
plt.figure(figsize=(10,8))
print('AUC-ROC for Valid Set')
plot_roc_curve(model, X_valid, y_valid);
plt.show()
plt.figure(figsize=(10,8))
print('AUC-ROC for Test Set')
plot_roc_curve(model, X_test, y_test);
plt.show()
Fit Time for Baseline RandomForestClassifier is: 136.4936501979828 seconds Baseline RandomForestClassifier Confusion Matrix for Training Set
Confusion Matrix for Validation Set
Confusion Matrix for Test Set
AUC-ROC for Train Set
<Figure size 720x576 with 0 Axes>
AUC-ROC for Valid Set
<Figure size 720x576 with 0 Axes>
AUC-ROC for Test Set
<Figure size 720x576 with 0 Axes>
[01:46:23] WARNING: ../src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior. Fit Time for Baseline XGBClassifier is: 51.30178761482239 seconds Baseline XGBClassifier Confusion Matrix for Training Set
Confusion Matrix for Validation Set
Confusion Matrix for Test Set
AUC-ROC for Train Set
<Figure size 720x576 with 0 Axes>
AUC-ROC for Valid Set
<Figure size 720x576 with 0 Axes>
AUC-ROC for Test Set
<Figure size 720x576 with 0 Axes>
#Print the exp log
expLog
| exp_name | Model_name | Train Acc | Valid Acc | Test Acc | Train F1 | Valid F1 | Test F1 | Train AUC | Valid AUC | Test AUC | Fit Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Baseline_67_features | Baseline XGBClassifier | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.7300 | 0.7385 | 49.3522 |
| 1 | Baseline_67_features | Baseline XGBClassifier | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.7300 | 0.7385 | 49.3522 |
| 2 | Baseline_67_features | Baseline LogisticRegression | 0.9193 | 0.9194 | 0.9195 | 0.0083 | 0.0076 | 0.0093 | 0.7378 | 0.7299 | 0.7386 | 5.8745 |
| 3 | Baseline_67_features | Baseline LogisticRegression | 0.9193 | 0.9193 | 0.9195 | 0.0081 | 0.0069 | 0.0099 | 0.7377 | 0.7300 | 0.7385 | 6.3473 |
| 4 | Baseline_67_features | Baseline DecisionTreeClassifier | 1.0000 | 0.8503 | 0.8507 | 1.0000 | 0.1408 | 0.1515 | 1.0000 | 0.5318 | 0.5380 | 22.4071 |
| 5 | Baseline_67_features | Baseline GaussianNB | 0.6791 | 0.6747 | 0.6812 | 0.2447 | 0.2405 | 0.2472 | 0.7229 | 0.7156 | 0.7225 | 1.8760 |
| 6 | Baseline_67_features | Baseline RandomForestClassifier | 1.0000 | 0.9189 | 0.9190 | 0.9998 | 0.0361 | 0.0336 | 1.0000 | 0.6903 | 0.6980 | 136.4937 |
| 7 | Baseline_67_features | Baseline XGBClassifier | 0.9231 | 0.9185 | 0.9190 | 0.1042 | 0.0366 | 0.0435 | 0.8586 | 0.7290 | 0.7381 | 51.3018 |
For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:
SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.
test_class_scores = model.predict_proba(X_kaggle_test)[:, 1]
test_class_scores[0:10]
array([0.02967482, 0.05190437, 0.02902925, 0.02359007, 0.09664427,
0.03307424, 0.04967597, 0.02642321, 0.00492934, 0.08720874],
dtype=float32)
# Submission dataframe
submit_df = datasets["application_test"][['SK_ID_CURR']]
submit_df['TARGET'] = test_class_scores
submit_df.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.029675 |
| 1 | 100005 | 0.051904 |
| 2 | 100013 | 0.029029 |
| 3 | 100028 | 0.023590 |
| 4 | 100038 | 0.096644 |
submit_df.to_csv("submission.csv",index=False)
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission"
Warning: Looks like you're using an outdated API Version, please consider updating (server 1.5.13 / client 1.5.12) 100%|█████████████████████████████████████████| 879k/879k [00:00<00:00, 989kB/s] Successfully submitted to Home Credit Default Risk
For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:
Please provide an abstract summarizing the work you did (150 words)
Please explain the work you conducted on feature engineering and transformers. Please include code sections when necessary as well as images or any relevant material
Please explain the pipelines you created for this project and how you used them Please include code sections when necessary as well as images or any relevant material
Please present the results of the various experiments that you conducted. The results should be shown in a table or image. Try to include the different details for each experiment.
Please include code sections when necessary as well as images or any relevant material
Discuss & analyze your different experimental results
Please include code sections when necessary as well as images or any relevant material
Please provide a screenshot of your best kaggle submission.
The screenshot should show the different details of the submission and not just the score.
Based on the performance indicators of the models, we finalized the pipeline with polynomial features. Given that the logistic regression is quicker than the other models and provides strong performance metrics, we discovered that these characteristics perform better and decided to use it.
Our main objectives in this phase were to do EDA, design new features, and develop a foundational model for estimating default risk. The two main problems that were encountered were comprehending the association between datasets and extracting the complex features. Upon submission to Kaggle, the baseline RidgeLogistic Regression model generated a ROC-AUC score of 71.49% and a Training ROC-AUC score of roughly 71.54%. Understanding the project requirements in the past. Project plan for a building. Exploratory Data Analysis, figuring out how features relate to each other, creating brand-new features, and building and baseline models.
Read the following: